Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

ReBNN: Resilient Binary Neural Network

in which γⁿ

is a balanced parameter. Based on the objective, the weight gradient in

Eq. (3.141) becomes:

δwⁿ

i ⁼^∂^L

∂wⁿ

+ γⁿ

i ⁽^wⁿ

i ⁻^αⁿ

i ^b^wⁿ

i )

= αⁿ

i ⁽^∂^L

∂ˆwⁿ

⊛1|wⁿ

i ^|≤¹⁻^γⁿ

i ^b^wⁿ

i ) + γⁿ

i ^wⁿ

i ^.

(3.144)

The Sⁿ

i ⁽^αⁿ

i ^,^wⁿ

i ^{) =}^γⁿ

i ⁽^wⁿ

i ⁻^αⁿ

i ^b^wⁿ

i ) is an additional term added in the backpropagation

process. We add this element because too small αⁿ

i ^{diminishes the gradient}^δ^wⁿ

i ^{and causes}

a constant weight wⁿ

i ^{. In what follows, we state and prove the proposition that}^δ^wⁿ

i,j ^is

a resilient gradient for a single weight wⁿ

i,j^{. Sometimes we omit the subscript}^{i, j}^{and the}

superscript n for an easy representation.

Proposition 1. The additional term S(α, w) = γ(w −αb^w) achieves a resilient training

process by suppressing frequent weight oscillation. Its balanced factor γ can be considered

the parameter that controls the appearance of the weight oscillation.

Proof: We prove the proposition by contradiction. For a single weight w centering around

zero, the straight-through-estimator 1|w|≤1 = 1. Thus, we omit it in the following. Based

on Eq. (3.144), with a learning rate η, the weight updating process is formulated as:

w^t⁺¹= w^t−ηδwt

= w^t−η[α^t( ^∂^L

∂ˆw^t⁻^γ^b^w^t^{) +}^γ^w^t^]

= (1 −ηγ)w^t−ηα^t( ^∂^L

∂ˆw^t⁻^γ^b^w^t⁾

= (1 −ηγ)

w^t−

ηα^t

(1 −ηγ)⁽^∂^L

∂ˆw^t⁻^γ^b^w^t⁾

(3.145)

where t denotes the t-th training iteration and η represents the learning rate. Diﬀerent

weights share diﬀerent distances from the quantization level ±1. Therefore, their gradients

should be modiﬁed according to their scaling factors and current learning rate. We ﬁrst

assume the initial state b^w^t= −1, and the analysis process applies to the case of initial

state b^w^t= 1. The oscillation probability from iteration t to t + 1 is the following:

P(b^w^t̸= b^w^t⁺¹)

b^w^t=−1 ^≤^P⁽^∂^L

∂ˆw^t^≤−^γ⁾^.

(3.146)

Similarly, the oscillation probability from iteration t + 1 to t + 2 is as follows:

P(b^w^t⁺¹̸= b^w^t⁺²)

b^w^t⁺¹=1 ^≤^P⁽

∂L

∂ˆw^t⁺¹^≥^γ⁾^.

(3.147)

Thus, the sequential oscillation probability from iteration t to t + 2 is as follows:

P((b^w^t⁺¹̸= b^w^t⁺²) ∩(b^w^t⁺¹̸= b^w^t⁺²))|bwt=−1

≤P

( ^∂^L

∂ˆw^t^≤−^γ⁾^∩⁽

∂L

∂ˆw^t⁺¹^≥^γ⁾

(3.148)

which denotes that the weight oscillation occurs only if the magnitudes of

∂L

∂ˆw^t^and

∂L

∂ˆw^t⁺¹

are more signiﬁcant than γ. As a result, its attached factor γ can be considered a

parameter used to control the occurrence of the weight oscillation.